Data Overview

Data description and analysis purposes

The dataset used is a collection of data, found on Kaggle, that provides detailed information about the top companies listed in Forbes in 2022.

Specifically, the information we have pertains to:

  • Rank : represents the position of each company in the ranking of the richest companies.
  • Global Company : is the full name of each company, which is uniquely identified by its name.
  • Country : indicates the country in which the company is headquartered.
  • Market Value : indicates the market capitalization of the company in 2022.
  • Profits : indicates the net profits generated by the company during the reference period.
  • Sales : represents the company’s total income in the year 2022.
  • Assets : represents the total value of the company’s assets.

The ultimate goal of this project is not to make predictions, but rather to explore the relationship between the predictors and the response variable. In more detail, we will focus on understanding the main factors that influence the market value of a company.

Preprocessing

The preprocessing operations include renaming the variables “Global Company” and “Market Value” to “global_company” and “market_value” for simplicity, converting quantitative variables from the character class to the numeric class and checking for the presence of missing values: luckily, none are found. Additionally, since monetary balances are expressed in different units of measurement, some in billions and others in millions of dollars, it was chosen to convert everything to billions.

The last modification involved the introduction of a new column labeled “continent.” This change was implemented because using the “country” variable, which contained 57 different countries, proved to be somewhat complex. By categorizing companies into continents we have reduced the number of distinct values to just 6. This adjustment improves the clarity and efficiency of the analysis that will be done soon.

This is how our dataset appears:

rank global_company country sales profit assets market_value continent
1 Berkshire Hathaway United States 276.09 89.80 958.78 741.48 North America
2 ICBC China 208.13 54.03 5518.51 214.43 Asia
3 Saudi Arabian Oil Company (Saudi Aramco) Saudi Arabia 400.38 105.36 576.04 2292.08 Asia

Exploratory Data Analysis

Before we create the model, it’s crucial to take a close look at the dataset so that we can understand how the different variables relate to each other. This will help us decide which ones are most important for our final model.

Quantitative Variables

Let’s start by analyzing the distribution of quantitative variables.

As suggested by the literature, it is expected to encounter a high variance in monetary variables. To address this issue, we apply a logarithmic transformation.

Indeed, the logarithmic transformation is employed for monetary variables to gain significant advantages in financial data analysis: it reduces data variance, making them more stable, promotes linearity in statistical models, simplifying the modeling of complex relationships, and helps align the data with a normal distribution, which is often a critical assumption in many statistical analyses.

As it is intended to show, after the transformation, the variables have a much more smooth distribution.

From this point forward, we will use the logarithmic-transformed data.

Qualitative Variable

In addition to quantitative variables, our dataset also includes a country variable, consisting of 57 unique values representing the 57 countries under consideration. Let’s explore how the geographic location of companies influences their market value.

What immediately jumps to the eye is that the U.S. is where the companies with the highest market value are located, followed by Canada, Australia, and some European and Asian states. Conversely, companies with lower market values tend to be primarily situated in Africa and South America. These significant differences between countries suggest that it might also be useful to include geographic location as a variable to be taken into account in our model.

However, considering the large number of countries in the dataset, handling numerous parameters for estimation could pose challenges in terms of interpretation and computational simplicity. To address this, we will analyze the data at the level of continents. It’s worth noting that we have divided the continent of America into North America and South America due to variations in market value observed between companies located in the northern and southern regions.

The patterns observed at the country level are clearly mirrored when examining market value on a continent-wide scale.

Outliers

Another aspect worth investigating to enhance the precision of our model is the presence of outliers.

Due to the presence of numerous outliers, what we can attempt to do is to consider only the observations that fall between the 5th and the 95th percentile with respect to the response variable.

The situation appears to have improved slightly. However, the conclusion from this outlier analysis is that it’s not the same company having anomalous values for all variables: a company that appears as an outlier for its market value is not the same one that appears as an outlier for its profits, and this pattern holds for all other variables as well. Therefore, let’s remove only those outliers related to market value.

Correlation

This correlation matrix suggests that profit exhibits a quite strong positive correlation with the market value of companies, in contrast to assets, which appear to be the least correlated variable. As for sales, there is a moderate correlation. However, although the correlation between the independent and response variables is a good indicator of a potential relationship, it is not sufficient to determine which variables to include or exclude from the model. Therefore, it would still be reasonable to start with a complete model that includes all variables.

Scatter Plot

The final aspect we examine before selecting the ultimate model is the relationship between the response variable and quantitative variables modeled through scatterplots.

Upon observing the scatterplots, it becomes evident that for all the quantitative variables that exists a more or less strong linear relationship with the “market value” variable. This reaffirms our initial observations when inspecting the correlation matrix.

After thoroughly analyzing the data and observed a linear relationship with the response variable, it could be appropriate to consider employing a linear model.

Frequentist vs Bayesian Linear Regression

Linear regression is a statistical technique widely used to model the relationship between a dependent variable,\(Y\), and one or more independent variables \(X_{1},...,X_{n}\). In traditional linear regression, the goal is to estimate model parameters as point estimates. The fundamental assumption is that the response variable \(y\) is a linear combination of a set of predictor variables \(X\) and weights \(\beta\), with the presence of a random error component \(\epsilon\) :

\(y = \beta^{T}X + \epsilon\)

\(\epsilon \sim N(0, \sigma^2)\)

The determination of the optimal weights, denoted as \(\hat{\beta}\), centers around minimizing the Residual Sum of Squares (RSS). This optimization technique, commonly known as Ordinary Least Squares (OLS), aims to minimize the sum of squares of the differences between the observed \(y\)-values and the predicted \(\hat{y}\)-values and provides a closed-form solution given by the equation:

\(\hat{\beta}=(X^{T}X)^{-1}X^Ty\)

However, this method lacks a natural way to account for uncertainty in these estimates.

Bayesian linear regression, on the other hand, offers a powerful alternative by treating model parameters as probability distributions.

The Bayesian approach lacks the fundamental assumption of the frequentist approach, that the data inherently contains sufficient information to accurately estimate the coefficients. Conversely, information is added from a prior probability distribution about the parameters to be estimated: this implies that we will no longer rely on a point estimate of the parameters, but integrate uncertainty into this new model in the form of a probability distribution.

More precisely, we maintain the linear relationship between predictors and their corresponding weights, but in this case, we regard the response variable as:

\(Y\sim N(\beta^{T}X,\sigma^{2}I)\)

Thus, the goal of this approach is to identify the posterior distribution of weights based on the prior probability distribution assigned to them.

But, how to choose the correct priors?

Usually, there are two potential options to consider. The first approach involves utilizing parameter-specific prior distributions for the parameters, possibly drawing from prior knowledge found in existing literature or specialized papers related to the subject matter. The second approach, as extensively discussed by Andrew Gelman, implies the use of general prior distributions, commonly referred to as uninformative priors.

Because there is no adequate literature about this topic to make this decision, weakly informative priors were chosen, as recommended by Gelman’s paper in case “there’s a reasonably large amount of data, so the likelihood will dominate, and the prior will not be so important”.

After this brief introduction, we have gathered the essential information required to construct and compare our models.

Remind that in constructing our models we use logarithmic-transformed outlier-free data to strive for the most accurate estimation possible.

Frequentist Approach

Model 1: complete

The first model includes all the variables.

mod1 = lm(market_value ~ sales + profit + assets + continent, data = dati_log)
summary(mod1)
## 
## Call:
## lm(formula = market_value ~ sales + profit + assets + continent, 
##     data = dati_log)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2140 -0.5731  0.0259  0.5453  4.6841 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.34765    0.22684  10.350  < 2e-16 ***
## sales                   0.15170    0.02048   7.406 1.95e-13 ***
## profit                  0.61204    0.02191  27.938  < 2e-16 ***
## assets                 -0.07557    0.01797  -4.205 2.74e-05 ***
## continentAsia          -0.10525    0.21795  -0.483  0.62923    
## continentEurope         0.14815    0.22025   0.673  0.50125    
## continentNorth America  0.60150    0.21897   2.747  0.00607 ** 
## continentOceania        0.47883    0.27351   1.751  0.08016 .  
## continentSouth America -0.10832    0.26415  -0.410  0.68180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8887 on 1868 degrees of freedom
## Multiple R-squared:  0.5033, Adjusted R-squared:  0.5012 
## F-statistic: 236.6 on 8 and 1868 DF,  p-value: < 2.2e-16

From an initial analysis of this model, what is striking is the almost absence of significance of the ‘continent’ variable. Therefore, let’s proceed to create a second model that considers only the quantitative variables, and then we will compare the results.

Model 2: quantitative variables

Let’s include only the quantitative variables.

mod2 = lm(market_value ~ sales + profit + assets, data = dati_log)
summary(mod2)
## 
## Call:
## lm(formula = market_value ~ sales + profit + assets, data = dati_log)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7590 -0.5983  0.0184  0.6265  4.5613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.65985    0.07594  35.024  < 2e-16 ***
## sales        0.15542    0.02157   7.205 8.36e-13 ***
## profit       0.66752    0.02274  29.359  < 2e-16 ***
## assets      -0.11543    0.01871  -6.168 8.45e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9374 on 1873 degrees of freedom
## Multiple R-squared:  0.4459, Adjusted R-squared:  0.445 
## F-statistic: 502.3 on 3 and 1873 DF,  p-value: < 2.2e-16

To compare the results, we consider various key metrics to assess their goodness.

Regarding the Residual Standard Error, both models have relatively low values, although in the first case, it is slightly lower, suggesting better data adherence to the model.

Moving on to the R-squared, the first model has higher R-squared and Adjusted R-squared values, indicating a better ability to explain the variation in the response variable. However, this result was expected, as R-squared tends to decrease when the number of predictors used is reduced.

Finally, analyzing the F-statistic, what we deduce is that the second model, which has a significantly higher F-statistic value, is globally more significant.

Certainly, the choice of the best model depends on the objectives and the preference for model complexity. To get a clearer picture of the situation, let’s repeat the analysis using the Bayesian approach.

Bayesian Approach

Model 1: complete

# Step 1: Define the model 
# First model: we're using the log_quantitative_variables + we're including the variable continent + we're considering the dataset without the outliers for the response variable market_value
set.seed(12)

lm1 = function(){
  
    # Likelihood:
    for (i in 1:N){
      
    market_value[i] ~ dnorm(mu[i], tau)
      
    mu[i] = alpha + beta_profit * profit[i] + beta_sales * sales[i] + beta_assets * assets[i] + beta_asia * continentAsia[i] + beta_europe * continentEurope[i] + beta_northAmerica * continentNorthAmerica[i] + beta_oceania*continentOceania[i] + beta_southAmerica * continentSouthAmerica[i] }
  
  # Priors:
  alpha ~ dnorm(0, 0.1)
  beta_profit ~ dnorm(0, 0.1)
  beta_sales ~ dnorm(0, 0.1)
  beta_assets ~ dnorm(0, 0.1)
  beta_asia ~ dnorm(0, 0.1)
  beta_europe ~ dnorm(0, 0.1)
  beta_northAmerica ~ dnorm(0, 0.1)
  beta_oceania ~ dnorm(0, 0.1)
  beta_southAmerica ~ dnorm(0, 0.1)
  
  tau ~ dgamma(0.01, 0.01)
  sigma = 1/sqrt(tau)
}

# Step 2: Choose the parameters to save
par1 <- c("alpha",
  "beta_profit",
  "beta_sales", 
  "beta_assets",
  "beta_asia",
  "beta_europe",
  "beta_northAmerica",
  "beta_oceania",
  "beta_southAmerica",
  "tau")

# Step 3: Define the data 
data1 <- list(
  N = nrow(data_filtrato),
  market_value = data_filtrato$market_value,
  profit = covariates1[, "profit"],
  sales = covariates1[, "sales"],
  assets = covariates1[, "assets"],
  continentAsia = covariates1[, "continentAsia"],
  continentEurope = covariates1[, "continentEurope"],
  continentNorthAmerica = covariates1[, "continentNorthAmerica"],
  continentSouthAmerica = covariates1[, "continentSouthAmerica"],
  continentOceania = covariates1[, "continentOceania"]
)

# Step4: Run the model 
fit1 <- jags(
  data = data1, 
  inits = NULL, 
  parameters.to.save = par1, 
  model.file = lm1,
  n.chains = 3, 
  n.iter = 10000, 
  n.burnin = 1000, 
  n.thin = 1, 
  DIC = TRUE,
  quiet = TRUE
)

fit1
## Inference for Bugs model at "C:/Users/ludom/AppData/Local/Temp/RtmpUHF3lu/model920306d3de8.txt", fit using jags,
##  3 chains, each with 10000 iterations (first 1000 discarded)
##  n.sims = 27000 iterations saved
##                    mu.vect sd.vect     2.5%      25%      50%      75%    97.5%
## alpha                1.819   0.198    1.429    1.686    1.819    1.953    2.203
## beta_asia           -0.017   0.190   -0.388   -0.145   -0.019    0.111    0.353
## beta_assets          0.117   0.016    0.086    0.107    0.117    0.128    0.148
## beta_europe          0.005   0.191   -0.369   -0.123    0.005    0.133    0.376
## beta_northAmerica    0.053   0.190   -0.321   -0.074    0.052    0.182    0.425
## beta_oceania         0.077   0.235   -0.384   -0.082    0.077    0.235    0.535
## beta_profit          0.207   0.020    0.167    0.193    0.207    0.220    0.246
## beta_sales           0.144   0.019    0.108    0.132    0.144    0.157    0.180
## beta_southAmerica   -0.050   0.224   -0.486   -0.202   -0.050    0.103    0.387
## tau                  2.310   0.084    2.148    2.253    2.310    2.367    2.479
## deviance          3003.712   4.473 2996.942 3000.472 3003.069 3006.219 3014.209
##                    Rhat n.eff
## alpha             1.001 27000
## beta_asia         1.001 27000
## beta_assets       1.001 15000
## beta_europe       1.001 27000
## beta_northAmerica 1.001 27000
## beta_oceania      1.001 27000
## beta_profit       1.001 10000
## beta_sales        1.001 27000
## beta_southAmerica 1.001 27000
## tau               1.001 27000
## deviance          1.001  5100
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 10.0 and DIC = 3013.7
## DIC is an estimate of expected predictive error (lower deviance is better).

Despite the apparent convergence, we immediately notice that the parameters related to the ‘continent’ variable are very close to zero, indicating their lack of significance for the model. Therefore, let’s construct a second model that only considers the quantitative variables, following the same approach as we did in the frequentist analysis.

Model 2: quantitative variables

# Step 1: Define the model 
# Second model: we're using the log_quantitative_variables + we're considering the dataset without the outliers for the response variable market_value 
set.seed(12)
lm2 = function(){
  
    # Likelihood:
    for (i in 1:N){
      
    market_value[i] ~ dnorm(mu[i], tau)
      
    mu[i] = alpha + beta_profit * profit[i] + beta_sales * sales[i] + beta_assets*assets[i]}
  
  # Priors:
  alpha ~ dnorm(0, 0.1)
  beta_profit ~ dnorm(0, 0.1)
  beta_sales ~ dnorm(0, 0.1)
  beta_assets ~ dnorm(0, 0.1)
  
  tau ~ dgamma(0.1, 0.1)
  sigma = 1/sqrt(tau)
}

# Step 2: Choose the parameters to save
par2 <- c("alpha",
  "beta_profit",
  "beta_sales", 
  "beta_assets",
  "tau")

# Step 3: Define the data 
data2 <- list(
  N = nrow(data_filtrato),
  market_value = data_filtrato$market_value,
  profit = covariates2[, "profit"],
  sales = covariates2[, "sales"],
  assets = covariates2[, "assets"])

# Step4: Run the model 
fit2 <- jags(
  data = data2, 
  inits = NULL, 
  parameters.to.save = par2, 
  model.file = lm2,
  n.chains = 3, 
  n.iter = 10000, 
  n.burnin = 1000, 
  n.thin = 1, 
  DIC = TRUE,
  quiet = TRUE
)

fit2
## Inference for Bugs model at "C:/Users/ludom/AppData/Local/Temp/RtmpUHF3lu/model92042d036b6.txt", fit using jags,
##  3 chains, each with 10000 iterations (first 1000 discarded)
##  n.sims = 27000 iterations saved
##              mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat
## alpha          1.848   0.064    1.721    1.805    1.848    1.892    1.975 1.001
## beta_assets    0.113   0.015    0.083    0.102    0.113    0.123    0.143 1.001
## beta_profit    0.213   0.020    0.174    0.200    0.213    0.227    0.252 1.001
## beta_sales     0.143   0.019    0.107    0.131    0.143    0.156    0.180 1.001
## tau            2.311   0.085    2.149    2.253    2.311    2.368    2.482 1.001
## deviance    3002.268   3.186 2998.088 2999.933 3001.609 3003.874 3010.230 1.001
##             n.eff
## alpha       27000
## beta_assets 27000
## beta_profit 27000
## beta_sales  27000
## tau         27000
## deviance     7200
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 5.1 and DIC = 3007.3
## DIC is an estimate of expected predictive error (lower deviance is better).

It’s evident, by looking at the DIC, that this second model is better than the previous one.

The DIC, or Deviance Information Criterion, is a statistical measure used in Bayesian model comparison and selection and serves as a tool to assess the goodness-of-fit and predictive performance of different models, particularly in the context of Bayesian analysis. It is based on the concept of “deviance,” which is a measure of how well a statistical model explains the observed data: lower deviance values indicate that a model fits the data better, capturing the underlying patterns and relationships more effectively.

The construction of the DIC takes into account two aspects:

  1. Model Fit (pD): This part evaluates how well the model fits observed data while considering its complexity. It penalizes models with excessive parameters that might overfit while impler models with fewer parameters are favored if they adequately explain the data.

  2. Predictive Performance (the actual DIC): The DIC value combines the deviance and pD to estimate the model’s expected predictive error and indicates how well the model is likely to perform when making predictions on new, unseen data. Lower DIC values indicate better predictive performance.

Of course, relying solely on the DIC is not enough to determine the actual goodness of a model. Let’s proceed to analyze the model diagnostics, which are useful for understanding how much we can trust the obtained estimates.

Diagnostics

Following with the analysis of the obtained output, we find:

R-hat

The Rhat, also known as the Gelman-Rubin statistic, is a measure used to check if the chains have converged sufficiently to estimate the posterior distribution accurately. An Rhat equal to 1 represents the ideal scenario: it indicates that the chains have converged to the same stationary distribution and that the parameter estimates are reliable. However, in general, Rhat values below a threshold of 1.05, as in our case, are widely accepted as an indicator of good convergence.

N.eff

The “n.eff” or effective sample size is a measure used to assess the quality of the (MCMC) samples generated during the estimation process. In Bayesian analysis, MCMC methods generate a sequence of samples that are not entirely independent due to the autocorrelation between consecutive samples: the n.eff helps quantify the magnitude of this autocorrelation and provides an estimate of how many independent samples you would need from a completely independent distribution to achieve the same level of precision.

As autocorrelation decreases, the ESS increases and the estimates are more precised.

In our case, the effective sample size is equal, for each parameter, to the total number of iterations: in fact we notice how the autocorrelation goes to practically zero already after the first lag.

Trace Plots

A traceplot is a graphical representation commonly employed in Markov Chain Monte Carlo (MCMC) analysis to visualize how chains behave when sampling statistical parameters. Its primary purpose is to track the evolution of these chains over time, providing insights into potential convergence problems.

The desired result is that the chains within each traceplot show stability (converging around a single value) and good mixing (all chains overlap closely around the same value).

What does this means?

Convergence refers to the property whereby MCMC chains eventually stabilize in a stationary distribution. In other words, they reach a point where their values no longer change significantly from one iteration to the next. Stability means that the chains have actually explored and found the central tendency or most likely value of the parameter to be estimated.

Mixture, on the other hand, refers to the effectiveness with which the MCMC chains explore the entire parameter space: if the chains are well mixed, it means that they move freely and explore different regions of the parameter space, which is essential for obtaining accurate estimates.

When all chains overlap around the same value, it means they have reached a common point of convergence and provide consistent estimates.

Looking at our result we can see that all the points developed above are perfectly met.

Density Plots

In addition to the traceplot, another valuable tool frequently employed is the density plot, which offers insights into the probability distribution of the estimated parameters.

When examining a density plot, if the peak of the curve is far from zero and is well-defined, it suggests that the parameter estimate is likely to be concentrated around that value with low uncertainty.

In this context, the immediate observation of all three chains converging to a normal distribution with a defined peak confirms the significance of the relationships between the coefficients highlighted in the traceplots above. Also, it can be seen that no parameter takes on a range of values that includes zero.

DIC comparison

The choice of the two models described above was the result of several attempts. Given the specific situation we encountered with the outliers and the asymmetric distribution of the monetary variables, the decision was made to use the dataset with logarithmic transformations and without outliers regarding the response variable.

We are now reproducing the same models using both the original dataset and the dataset containing all observations (including outliers) to make sure that the choices made are correct.

By comparing the DIC values of the models, we can see how the modifications made to the initial dataset have indeed proven to be correct.